library(ggplot2)
library(tidyverse)
library(dplyr)
library(nycflights13)
library(plotly)
library(maps)
library(geosphere)
library(ggpubr)
options(scipen=999)
library(reshape2)

Question 1.

Section 1. - What questions / stories the graphic is trying to answer?

Plot A:

The plot is to “mapping” the precentage of late flights (which is defined by 15 minutes or more delay in departure), based on the destination of arrival (airport).
Questions/stories the graph tries to answer:
i.
(1) Does the flight’s distance affecting the probability of a flight to be delayed?
(2) Does it answer succesfully? In our opinion it doesn’t, since the legend isn’t intuitive enough, there isn’t a clear image which “pops” to your eye. Moreover, the lines overlap which hardens the examination.
ii.
(1) Is there a geographic trend? (does a flight to the west more probable to delay than a flight to the east?)
(2) Does it answer succesfully? Partly, we can’t determine by looking at the graph whether there is such trend, and if one does exist, then the graph doesn’t hold.
iii.
(1) Do different airports in the same state have different preformance in terms of chance to be late?
(2) Does it answer succesfully? Partly, since the lines overlap, the nearby destenations hidden, therefore we can’t conclude almost anything actually, but regarding long distance destinations, the graph suffices an answer.
(3) Questions that rised from the plot which are not addressed :
1. What is the traffic in each flight lane?(i.e how many flights between the two airports?) 2. How busy/crowded the destination airport is?

  1. A way to improve:
    Instad of using overlapping lines to connect between the origin airport (despite the partly loss of geographical distance input) and the destination, it is better to use colorful points that mark the destination airport. This way the reader of the map can easly see the flights that are closer to the origin airport.Also, in this way we can add more information to the plot(for example the size of the point will determent the number of flight that went from the origion airport to the destination)

Plot B:

The plot is trying to find a link between the amount of flights per period of time and the amount of delayed, and to declare whether there is a high correlation between the weekly cycle and higher amount of delayed flights (do they have the same cycle). Questions/stories the graph tries to answer:
i.
(1) What kind of events affect the relation above? (for example holidays, economic state, wheather and cetera)
(2) Does it answer succesfully? Partly, it does specify the main events but doesn’t provide further information (more minor events for example) ii.
(1) What is the relation between the number of total flights and the number of late flights?
(2) Does it answer succesfully? We believe that the plot shows that in some periods the relation is not direct, while in other periods it is. Therefore it does answer the question, the answer is that the relation differentiate throught the year.
(3) Questions that rised from the plot which are not addressed:
1. Does the cycle of the total flight matches the cycle of specific destination (i.e if we look at the cycle of outside) 2. infromative flaws: from which airport/airports the data was taken?, distinction between domestic flights and international flights?

  1. A way to improve:
    We could plot the data in a bar-plot, where each bar contains the total number of flights (in blue), and part of the bar stands for the late flights number, this way the relation is more direct (to visualize - a bar which is partly blue [total - late] and partly red [bottom - late])

Question 2

(1) A graphic summarizing the percent of flights delayed, broken by destination Airport.

nyfly <- flights
airport <- airports %>% select(faa,lat,lon) %>%  filter(faa %in% c(nyfly$origin,nyfly$dest))
colnames(airport)[1]="origin"
join1 <-left_join(nyfly,airport,by="origin")
colnames(airport)[1]="dest"
join2 <-left_join(join1,airport,by="dest")
join2 <- na.omit(join2)
join2 <-join2[which(join2$origin=="JFK"),]
colnames(join2)[20:23] <-c("origin_lat","origin_lon","dest_lat","dest_lon")
join2 <- join2 %>% mutate(late = ifelse(dep_delay>15,1,0))
join2 <- join2 %>% mutate(tot = 1)
latepercent <- join2 %>% group_by(dest) %>% summarise(late=sum(late)/sum(tot))
join2 <-join2 %>% select(-late,-tot)
desti <- join2 %>% distinct(dest,.keep_all = T) %>% left_join(latepercent,by="dest")
desti <- desti %>% mutate(color = ifelse(desti$late<=0.1,"green",ifelse(desti$late>0.1 & desti$late <=0.15,"purple",ifelse(desti$late>0.15 & desti$late<=0.20,"blue",ifelse(desti$late>0.20 & desti$late<=0.25,"orange","red")))))
MainStates <- map_data("state")

airport2 <-airports %>% select(faa,name,lat,lon) %>%  filter(faa %in% c(nyfly$origin,nyfly$dest))
fig <- ggplot() + 
  geom_polygon( data=MainStates, aes(x=long, y=lat, group=group),
                color="black",fill="white") + geom_point(aes(x=airport2$lon,y=airport2$lat,text=airport2$name),shape=1) + geom_segment(data=desti,aes(x =origin_lon, y = origin_lat, xend = dest_lon, yend =dest_lat,col=ifelse(desti$late<=0.1,"<10%",ifelse(desti$late>0.1 & desti$late <=0.15,"10%-15%",ifelse(desti$late>0.15 & desti$late<=0.20,"15%-20%",ifelse(desti$late>0.20 & desti$late<=0.25,"20%-25%","25%>"))))),alpha=0.8,size=0.8) +annotate("text",x=state.center$x,y=state.center$y,label=ifelse(state.abb=="HI"|state.abb=="AK","",state.abb),size=2.5) +scale_color_manual(name = "%", labels = c('<10%',"10%-15%","15%-20%","20%-25%",">25%"), values = c("green","purple","blue","orange","red"))+ ggtitle("Percent of Flights Departurs Delayed Over 15 Minutes ",subtitle = "Airport=JFK    Year=2013") + coord_cartesian(xlim=c(-130, -65), ylim = c(25, 50)) +theme_bw()+ theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.title.y=element_blank(),axis.text.y =element_blank(),axis.ticks.y = element_blank(),plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5),legend.title = element_text(hjust = 0.5),legend.position = "left")

ggplotly(fig,tooltip = c("text")) %>% layout(legend = list(x = 0, y = 0.2,size=12),title=list(text = paste0("Percent of Flights Departurs Delayed Over 15 Minutes","<br>","<sup>","Airport=JFK    Year=2013","</sup>")))

(1) A graphic summarizing the flight volume and flights delayed, broken by day and showing weekly cycles.

join3 <- nyfly %>% mutate(late = ifelse(dep_delay>15,1,0))
join3 <- join3 %>% mutate(tot = 1)
cyc <- na.omit(join3) 
cyc <- cyc %>% group_by(month,day) %>% summarise(tot =sum(tot),late=sum(late)) %>% ungroup() %>% mutate(dates = seq(as.Date("2013-1-1"), as.Date("2013-12-31"), by="days"))
Sys.setlocale(category = "LC_ALL", locale = "english")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
cyc %>% ggplot(aes(x=dates,y=tot)) + geom_point(aes(x=dates,y=tot,color="Total Flights"),size=3)+ scale_color_manual(values = c("red","blue")) + geom_line(col="blue") + geom_point(aes(x = dates, y = late,color="Late Flights"), size = 3) + geom_point(aes(x=as.Date("2013-02-09"),y=min(tot)),size=15,col=" dark blue")+ geom_segment(aes(x = as.Date("2013-01-09"), y = 475, xend =as.Date("2013-02-06"), yend = 300),arrow = arrow(length = unit(0.5, "cm")),size=2)+geom_segment(aes(x=dates,xend=dates,y=0,yend=late,color="Late Flights"),size=1,col="red")+annotate("text",x=as.Date("2013-01-15"),y=500,label="February 2013 North American blizzard",size=6)+ ggtitle("Weekly Cycles",subtitle = "year=2013") +theme_bw() +theme(plot.title = element_text(size=50,hjust = 0.5),axis.title = element_text(size=30),axis.text = element_text(size=30),legend.text = element_text(size = 30),legend.title = element_blank(),axis.ticks = element_line(size=3,color = "black"),legend.position = "top",legend.direction = "vertical",plot.subtitle =element_text(size=25,hjust = 0.5))+ ylab("Flights per day") + scale_x_date(date_labels = "%b %d %y",date_breaks = "1 month") 

Question 3

nyfly <- nyfly %>% mutate(tot=1)
nyfly <- nyfly %>% mutate(late = ifelse(dep_delay>15,1,0))
nyfly <- na.omit(nyfly)
nyfly["time_int"] <- cut(nyfly$sched_dep_time, seq(0,2400,100), labels=seq(1,24,1))
latepercent2 <- nyfly %>% group_by(hour) %>% summarise(late=sum(late)/sum(tot))
latepercent2$times <- c("5:00-6:00","6:00-7:00","7:00-8:00","8:00-9:00","9:00-10:00","10:00-11:00","11:00-12:00","12:00-13:00","13:00-14:00","14:00-15:00","15:00-16:00","16:00-17:00","17:00-18:00","18:00-19:00","19:00-20:00","20:00-21:00","21:00-22:00","22:00-23:00","23:00-24:00")


hour_analysis<- ggplot(latepercent2, aes(x=hour,y=late)) + geom_bar(stat="identity",text=latepercent2$times,col="black",fill="blue") +
  ggtitle("Bar plot of late percetnage by hour of scheduled
 departure time in New York airports",subtitle = "Year=2013     City = New York") +  
  xlab("Hour of flight (in 24h format)") +theme_bw() +
  ylab("Late percentage (>15 minutes)") +coord_cartesian(xlim = c(5,24)) + scale_x_continuous(breaks = seq(5,23,1),labels = latepercent2$times)+ theme(axis.text.x = element_text(vjust = 0.5,angle = 90),plot.title = element_text(hjust = 0.5,size=10),plot.subtitle = element_text(hjust=0.5)) 




trafic_analysis <- nyfly %>% group_by(hour) %>% summarise(tot=sum(tot))
trafic_analysis$times <- c("5:00-6:00","6:00-7:00","7:00-8:00","8:00-9:00","9:00-10:00","10:00-11:00","11:00-12:00","12:00-13:00","13:00-14:00","14:00-15:00","15:00-16:00","16:00-17:00","17:00-18:00","18:00-19:00","19:00-20:00","20:00-21:00","21:00-22:00","22:00-23:00","23:00-24:00")
trafic_plot<-trafic_analysis %>% ggplot(aes(x=hour,y=tot)) + geom_bar(stat="identity",col="black",fill="blue",text=trafic_analysis$tot)+coord_cartesian(xlim = c(5,24))+ylab("Number of flights")+xlab("Hour of flight (in 24h format)") +
ggtitle("Bar plot of all flights by hour of scheduled departure time in New York airports",subtitle ="Year=2013     City = New York" ) + scale_x_continuous(breaks = seq(5,23,1),labels = latepercent2$times)+ theme_bw()+theme(axis.text.x = element_text(vjust = 0.5,angle = 90),plot.title = element_text(hjust = 0.5,size=10),plot.subtitle = element_text(hjust=0.5))



ggarrange(trafic_plot,hour_analysis)

Plot Caption:

We wanted to test whether there is a relation between the number of flights(=traffic) and the percentage of delayed departures.
In other words,we wanted to see whether heavy traffic incearsing the chances of a flight being delayed(by obvious reasons such as human errors,line for the departures lanes et cetera).
First, we grouped the data by hour of schedualed departure.
Then, we used barplots to visualize the number of flights in each hour(left plot) and the percentage of delayed flights in each hour(right plot).
As we can see in the figures above, the percentage of delayed flights in consistanly increasing as throughout the day, peaking at around 9-10 PM
(Note: there are no observation of flights that depature between 12PM to 5AM which could shed new light about the distribution of the late percentage around the day).
On the other hand, we can see that the amount of flights between 5AM and 12PM does not linearily increase thoughout the day as opposed to the late percentage figure.

Naivly enough, we can assume that the later it is, the more probable it is to be late.
Similarly to many real cases of life, there are probably more factors involved, meaning that the late percentage is affected by many reasons (some of them might be inconsistent such as different patterns during holidays et cetera), the weather, time of shifts change.
As we noted above, if we had observation of departures during 12PM-5AM, maybe the conclusion was different.

Note another interesting fact, since the data is for domestic flights, it doesn’t operate overnight, meaning the latest schedualed departure is at 12PM, thus could inflect the reason of increased late percentage as the day goes by (natural exhaustion).

Question 4

(1) Produce a graphic that tries to answer this question for the real data.

#sample1 <- sample(length(nyfly$month),size = 1000,replace = T)
late_month <- nyfly %>% group_by(month) %>% summarise(prop=sum(late),tot=sum(tot))
late_month$prop <- late_month$prop/late_month$tot
late_month %>% ggplot(aes(x=month,y=prop))+ geom_bar(stat="identity",col="black",fill="red") + scale_x_continuous(breaks = seq(1,12,1),labels = month.abb[late_month$month])+scale_y_continuous(breaks = seq(0,0.3,0.1),labels = c("0%","10%","20%","30%"))+ theme_bw()+ ggtitle("Proportion of delayed flights per month")+theme(plot.title = element_text(hjust=0.5))+ ylab("Proportion") + xlab("Month")  

(2) A graphic summarizing the percent of flights delayed, broken by destination Airport.

Note that in the figure, the 6th graph is the “real” graph from the data

set.seed(123)
sam_vec <- c()
for (i in 1:5) {
   sam_vec <-cbind(sam_vec,late_month$prop[sample(x = c(seq(1,12,1)),size = 12,replace = F)])
   
}
sam_vec <- data.frame(sam_vec,X6=late_month$prop,month=late_month$month)
sam_vec <- melt(sam_vec,id.vars = 'month', variable.name = 'sample')

sam_vec %>% ggplot(aes(x=month,y=value)) + geom_bar(stat="identity",col="black",fill="red") +facet_grid(~sample) + scale_x_continuous(breaks = seq(1,12,1),labels = month.abb[late_month$month])+scale_y_continuous(breaks = seq(0,0.3,0.1),labels = c("0%","10%","20%","30%"))+ theme_bw()+ ggtitle("Proportion of delayed flights per month")+theme(plot.title = element_text(hjust=0.5),axis.text.x = element_text(vjust = 0.5,angle = 90))+ ylab("Proportion") + xlab("Month") 

Explanation:
In this segment we were asked to check whether delayed-departure has a seasonal pattern?
Our null hypothesis is: • The proportion of delayed flights per month is independent across months.

These six plots show the The proportion of delayed flights per month.
Five of those plots are simulated under the null hypothesis that claims that
the proportion of delayed flights per month is independent across months and one of those plots made from a real dataset.
Under the null hypothesis the proportion of delays in month X is not
associated with delays in Month Y. Therefore, to simulate the data under the null hypothesis we took the original and randomly permute the values of the delayed propotion across the month to simulate the independent across months.

Is it easy to tell apart the real data from the simulated ones?
Partially.
At first it’s confusing to tell the difference between one plot and the other but when start to look for seasonal patterns in the data it becomes clearer.
In order to see the pattern we will try to smooth the bar plots into a continuous line to check if the pettern is more prominent.

sam_vec %>% ggplot(aes(x=month,y=value)) + geom_smooth() + facet_grid(~sample)+scale_x_continuous(breaks = seq(1,12,1),labels = month.abb[c(seq(1,12,1))]) +scale_y_continuous(breaks = seq(0,0.3,0.1),labels = c("0%","10%","20%","30%")) + theme_bw()+ ggtitle("Proportion of delayed flights per month")+theme(plot.title = element_text(hjust=0.5),axis.text.x = element_text(vjust = 0.5,angle = 90))+ ylab("Proportion") + xlab("Month") 


Here we can see a much more significant difference between the simulation plots and the real dataset plot.
The real dataset point has a clear “N” shape and the smallest variance (grey area) which implys presence of pattern.
But still even after smoothing the bar plot we can still have simulations that look similer to the real data plot.

Form this process it is possible to learn that visual patterns can be
distinguish between from random patterns, however it is not always easy to
notice and it might be useful to try another type of plots.
The results from this hypothesis testing are not clear.
On the one hand we can see simulations that look similer to the real data
plot the so we can not reject the null hypothesis based on this test.
On the other hand we do not know the level of significance of this test and
can easly make a type 2 error.
Therefore we may get a more unequivocal result if we use numeric tools